A Tiered CRF Tagger for Polish
نویسنده
چکیده
In this paper we present a new approach to morphosyntactic tagging of Polish by bringing together Conditional Random Fields and tiered tagging. Our proposal also allows to take advantage of a rich set of morphological features, which resort to an external morphological analyser. The proposed algorithm is implemented as a tagger for Polish. Evaluation of the tagger shows significant improvement in tagging accuracy on two state-of-the-art taggers, namely PANTERA and WMBT.
منابع مشابه
TLT-CRF: A Lexicon-supported Morphological Tagger for Latin Based on Conditional Random Fields
We present a morphological tagger for Latin, called TTLab Latin Tagger based on Conditional Random Fields (TLT-CRF) which uses a large Latin lexicon. Beyond Part of Speech (PoS), TLT-CRF tags eight inflectional categories of verbs, adjectives or nouns. It utilizes a statistical model based on CRFs together with a rule interpreter that addresses scenarios of sparse training data. We present resu...
متن کاملTrigram morphosyntactic tagger for Polish
We introduce an implementation of a plain trigram part-of-speech tagger which appears to work well on Polish texts. At this moment the tagger achieves 9.4% error rate, which makes it signficantly better than our previous stochastic disambiguator. Since the trigram model for Polish behaves similarly to Czech, we hope to reach Czech state-of-art error rate when the quality of the training data im...
متن کاملIntegrating high dimensional bi-directional parsing models for gene mention tagging
MOTIVATION Tagging gene and gene product mentions in scientific text is an important initial step of literature mining. In this article, we describe in detail our gene mention tagger participated in BioCreative 2 challenge and analyze what contributes to its good performance. Our tagger is based on the conditional random fields model (CRF), the most prevailing method for the gene mention taggin...
متن کاملOptimisation of Polish Tagger Parameters
The large tagset of the IPI PAN Corpus of Polish enforced a modular architecture of the Polish tagger called TaKIPI. The architecture introduce several parameters, for learning and tagging, that are difficult to be properly adjusted manually. In this paper a method of optimisation of the parameters values based on Genetic Algorithm is presented. A chromosome is a set of values, a specimen is a ...
متن کاملEvaluating the Impact of External Lexical Resources into a CRF-based Multiword Segmenter and Part-of-Speech Tagger
Résumé This paper evaluates the impact of external lexical resources into a CRF-based joint Multiword Segmenter and Part-of-Speech Tagger. We especially show different ways of integrating lexicon-based features in the tagging model. We display an absolute gain of 0.5% in terms of f-measure. Moreover, we show that the integration of lexicon-based features significantly compensates the use of a s...
متن کامل